Introduction

Trump vs Biden 2020 Presidential Campaign Twitter Sentiment Introduction One of the hottest topics in 2020 was the presidential election between Donald J. Trump and former vice president Joe Biden. This election was full of firsts; It was the first election held during a worldwide pandemic, the first election to have 3 states whose margin of victory was under 1%, and the first incumbent president not to concede. Because of the election's uniqueness and the highly contrasting personalities between the candidates, I decided to analyze the sentiment of each candidate’s tweets to find out how much their social media attitude affected their twitter impressions and their general approval.

In the analysis, we mapped the candidate's sentiment in a time series as well as measured how popular their most liked and dislike tweets were through retweets and likes. We wanted to see if there was a correlation between negative or positive sentiment and popularity for each candidate.

We got the motivation for this topic from an article published by Cambridge University Press titled: “Differences in negativity bias underlie variations in political ideology”. This article discusses how negative thoughts gain more attention and popularity than positive thoughts do. Negative thoughts also stay within our memory for a longer period. This also links to negative bias in politics, which triggered an idea to apply sentiment analysis in politics. The 2020 presidential election was the perfect area to focus this analysis on negative political bias.

Import Resources

Read Libraries

Read Data

Web Scraping

We will get the most recent Trump and Biden tweets by web scraping twitter to obtain tweets from August 2020 - November 2, 2020

The script below gets the 2000 most recent tweets from a twitter account and stores it as a pkl file. We will only run it occasionally to control the API requests we make to twitter as they have a limit on how many you can make.

The pkl files will be named trump_tweets.pkl and biden_tweets.pkl accordingly.

Import CSV files

We will import the historical twitter data for Trump and Biden from the Kaggle repositories (Up to 8/30/2020)

We renamed the CSV files to TrumpOld.csv and BidenOld.csv respectively.

Data Wrangling

Exploratory Analysis

First we'll print out 5 tweets to evaluate their format

conclusion

It seems that based off the first 5 tweets for each candidate, there are links, retweet identifiers (RT) and punctuation that we'll have to clean before applying NLP or sentiment analysis processes.

We'll also examine the "keys" for each tweet, which are the variables associated with the tweet object.

We can use twitter's official developer documenation to examine what each variable means:
https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet

Make DataFrames

Filter out rewteets from Twitter API data

The data obtained from Twitter's API contains a column named rewtweet_status which identifies whether or not a tweet is a retweet, according to the documentation. We will use this to take out retweets from the pkl files

Make DataFrames from pkl files

We'll make dataframes for Trump and Biden from the pkl files.

Merge pkl and historical csv files

We will also merge the historical tweets data with the recently web scraped tweet data and assure there are no duplicates.

We will also need to remove retweets, adjust the column order/names, refactor the date column, and only include data from 4/25/2019 onwards, as that's the date when Joe Biden was chosen as a presidential candidate.

Finally we will join the datasets together.

Drop Duplicate tweets

It seems like Trump had almost 200 duplicate tweets while Biden had 1.

Clean Tweet content

Remove noise

We use regex to remore words and symbols that don't mean anything in speech such as:

Remove small tweets

A large portion of tweets from Trump are too short and vague to be included in the NLP analysis. A large portion of these tweets, are thanking people with the hashtage(#) symbol, which has been removed. So because of tweets liket these, we will remove tweets that are 5 of fewer words.

Remove Likes Outliers

Tweets that are considered outliers in terms of how many likes they got will be removed. If the number of likes for a tweet is more than two standard deviations away from the mean of likes.

Adding Sentiment and Subjectivity

We will use the TextBlob library to assign sentiment and subjectivity to tweets.

Test TextBlob

Sentiment is how positive or negative a sentence is

Sujectivity is how much of an opinion or fact a sentence is (0 is opinion, 1 is fact)

Add sentiment and subjectivity to each tweet

Add a -1, 0, 1 label for each tweet

Basic Visualization

First I'll take a look at the average likes per week for each candidate

As we can see, Trump has a lot more likes, mostly because he was just a much more popular person than Biden because of his personality. It seems that around June 2020 is when Biden gets a huge jump in popularity making him be on par with Trump in terms of twitter likes.

We can assume that this is because Trump had 4 years to become famous, or infamous, while Biden needed a year of campaigning to get his popularity on part with Trump. We'll be only considering tweets from June 2020, onwards moving forward.

A we can see from the chart above, both candidate's tweets are similar in their sentiment between June 2020 and November 2nd 2020.

However, Trump is noticeably more extreme in his positive and negative sentiment overall while Biden is slightly more neutral.

Based on the chart above, we can see that there is very little to no correlation to the disparity of daily likes and the sentiment for Biden.

The correlation coefficient between tweet likes and sentiment shows that the correlation coefficient is essentially zero.

Based on the chart above, we can see that there is slight more correlation between negavtive sentiment tweets and more likes.

The correlation coefficient of -.11 confirms that there is a slight negative correlation meaning that as tweets get more negative in sentiment, they get more likes by a small ammount.

From this plot, we can see that the more subjective the stronger the sentiment is

NLP

Since Textblob is not catching Trump's speech patterns correctly, we decided that we would attempt to train a classifier to identify positive and negative trump tweets. Our team manually classified around 800 Trump tweets from the latter half of 2020 to train our classifier

Clean Data

We will use a Logistic Regression classifier, so we will remove neutral tweets, with a 0 sentiment. We will also drop all NAs as those will get rid of the tweets that we didn't get to manually classify.

Tweets that are shorter than 5 words will be removed, as there is usually not enough data to correctly assign a good sentiment.

Format tweets

The cleanerNLP function will perform several actions on the tweet data to prepare it for the classifier.

Feature Engineering

To feed the cleaned tweet data to the logistic regression, we will need to turn it into a "bag of words" format, so that the Logistic Regression Model can properly process each word as a feature. The count of each word in a tweet will get totaled for each feature.

sklearn's CountVectorizer method will be used for this. Since I already tokenized the data beforehand, I will use a dummy function dummy to make the CountVectorizer ignore preprocessing.

Add the dependent variable to the Logistic Regression dataframe

Refactor the dependant variable for Logistic Regression (1=positive sentiment, 0=negative sentiment)

Logistic Regression

In our current dataset, we have 228 negative sentiment tweets and 210 positive sentiment tweets, so it's about 50/50. The logistic regression classifier must have an accuracy higher than 50% to be useful, as it would be the equivalent of a coin flip otherwise.

Test Train Split

60% of the data will be used for training, 40% for testing. The data will be randomly shuffled and we'll perform stratified sampling to make sure we have a proportional ammount of negative and positive sentiment tweets in each split.

Fit the model

correlation matrix

The model predicted Trump's sentiment with a 60% accuracy. Obviously, this has major room for improvement, but our results are a start.

Future Model improvements